N87-26561 


SOLVING VERY LARGE, SPARSE LINEAR SYSTEMS ON MESH-CONNECTED 

PARALLEL COMPUTERS 

Torstein Opsahl* 

MRJ, Inc. 

Perkin-Elmer Advanced Development Center 
Oakton, Virginia 22124 


John Reif** 

Duke University 
Durham, North Carolina 27706 


ABSTRACT 

In this paper, the implementation of Pan and Reif's 
Parallel Nested Dissection algorithm on 
mesh-connected parallel computers is described. 
This is the first known algorithm that allows very 
large, sparse linear systems of equations to be 
solved efficiently in polylog time using a small 
number of processors. We describe how the 
processor bound of PND can be matched to the 
number of processors available on a given parallel 
computer by slowing down the algorithm by 
constant factors. Also, for the important class of 
problems where G(A) is a grid graph, we detail a 
unique memory mapping that reduces the 
inter-processor communication requirements of 
PND to those that can be executed on 
mesh-connected parallel machines. The paper 
concludes with a description of an implementation 
on the Goodyear Aerospace Massively Parallel 
Processor (MPP), located at NASA Goddard Space 
Flight Center, for which we give a detailed 
discussion of data mappings and performance 
issues. 
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INTRODUCTION 

The solution of large, sparse linear systems, Ax=b, 
pervades many areas of physics and engineering. 
Although parallel algorithms for tackling these 
problems have existed for a number of years, they 
have usually been impractical to implement because 
of unrealistically high time bounds or processor 
bounds, or both. In other cases, numerical stability 
has been a problem; that is, unless the calculations 
were performed in infinite precision they would yield 
no solution at all. However, none of the above 
problems apply to Pan and Reif's (Ref. 8) parallel 
nested dissection (PND) algorithm. This algorithm is 
based on computing a special recursive 
factorization of A, thus reducing the problem of 
inverting a large sparse matrix to that of inverting a 
number of much smaller dense matrices. PND has a 
considerably smaller processor bound than other 
polylog parallel methods for solving sparse linear 
systems. Furthermore, for any given parallel 
computer with p processors, the algorithm can be 
slowed down to give the best known time bound 
for this processor bound p. 

If A is an nxn matrix, we define a graph G(A)=(V,E) to 

be the undirected graph with vertex set V={1 n} 

and edge set E={{i,j}|Aj: not equal to 0}. Here we 
shall confine our attention to the application of PND 
to sparse, linear systems where G(A) is a 
two-dimensional (2-D) grid graph. Such systems 
occur very extensively in the solution of partial 
differential equations. However, it should be 
emphasised that the algorithm is not restricted to 
this class of problems. 
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The Massively Parallel Processor (MPP) is an SIMD 
controlled, fine-grained, 2-D mesh-connected 
parallel computer with 16,384 Processing Elements 
(PEs) built by Goodyear Aerospace for NASA. This 
paper will describe how PND can be implemented 
on this computer and other parallel machines with a 
similar architecture. The implementation is 
particularly challenging because of the SIMD control 
structure of these computers. Central to the 
implementation is a unique memory mapping that 
reduces the communication requirements of PND 
to those that can be executed using only mesh 
connections. Also, the algorithm has been 
implemented on the Connection Machine, built by 
Thinking Machines Corporation, whose processors 
are connected in a hypercube. This implementation 
is described in Ref. 5. 

In the next section, we give a more detailed 
description of PND, while the final section contains 
the implementation method for mesh-connected 
parallel computers together with some performance 
estimates. 


DESCRIPTION OF PND FOR GRID 
GRAPHS 

The method of parallel nested dissection (PND) and 
its proof has been described in detail elsewhere 
(Ref. 8). Here, we will give a relatively 
self-contained overview of the method. Since it is 
far from obvious how PND can be used on the MPP, 
the description concentrates on those aspects of 
the method that are critical to understanding our 
implementation for mesh-connected parallel 
computers. 

Fundamental to the nested dissection of 
undirected graphs is the idea of separators. A 
separator is a set of vertices that partitions a graph 
into two sub-graphs that are connected only 
through the separator. Each of the sub-graphs 
must contain no more than 2/3 and no less than 1/3 
of the nodes of G(A). For a graph of n vertices, the 
size of any separator is bound by a function s(n). In 
the case of planar graphs s(n) isO(n 1/2 ). An 
undirected graph is said to have an s(n)-separator 
family if the class of all its sub-graphs has an 
s(n)-separator family. Binary trees, for instance, 
have a 1 -separator family while a d-dimensional grid 
of uniform size in each dimension has n 1 "'™ 1 ) 
-separators. 


The removal of separators from a graph, G(A), and 
its resultant sub-graphs, is in fact analogous to 
eliminating the unknowns of the associated linear 
system, in the order given by the numbering of the 
vertices in the separators. Sequential methods for 
solving sparse, linear systems have long made use 
of nested dissection (Ref. 3) to reduce the problem 
to that of inverting dense matrices of size at most 
s(n)xs(n). PND is the first algorithm for efficiently 
solving sparse, positive definite linear systems, 
Ax=b, on parallel computers that only requires 
polylog time and s(n)^ processors. Moreover, the 
algorithm is numerically stable. PND constructs in 
O(logn) stages a special recursive factorization of A. 
This factorization is distinct from the 
LDL -factorization used in sequential dissection. 
Another important difference between sequential 
dissection and PND is that the latter requires the 
elimination of many, rather than one, separators at 
each step of the algorithm. 

Suppose G(A) has a separator that decomposes the 
graph into two sub-graphs. Furthermore, suppose 
that these sub-graphs can be partitioned recursively 
through the use of separators. In this way we can 
construct a separator tree (see figure 1) whose root 
is the separator of G(A). This root has two children 
that are the separators of the two sub-graphs. Each 
of these children is in turn the parent of two new 
children and so on. The leaves of the separator tree 
are singleton node sets of remaining sub-graphs. 

One of the general difficulties of PND is to compute 
efficiently in parallel s(n)-separators. Fortunately, for 
the practically important case of grid graphs, which 
we are concerned with here, there is a simple way to 
do this in O(logn) steps. At the completion of these 
steps the vertices of the graph have been 
renumbered from 1 to n using the separator tree as 
a guide. PND requires that A is permuted to reflect 
this renumbering. Assuming this has been done, 
Pan and Reif (Ref. 8) define a recursive 
s(n)-factorization by a sequence of matrices Aq, 
A^..., A d where, 
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Aq= PAP^ ; P is a permutation matrix 



Wl +Y h x h' lY h T ; <M 

There are several important points to note about 
equation (1). Let d be O(logn). Also, let N h bethe 
number of separators at level h and n h k (k= 1 ...N h , 
n hk <=f2/3) h ) the maximum size of any separator at 
the tv" level. It can then be seen that is a 


block-diagonal matrix consisting of square 
blocks of sizes at most n h k xn h k . Thus, X^' ' can 
be computed by inverting these small diagonal 
blocks separately. It also follows that the matrix 
product Y^X^ 1 Y h ' can be decomposed into 0(N h ) 
triplets of sizes 0(n h k xn h k ). Pan and Reif (Ref. 8) 
prove that both of the above operations can be 
performed in 0(log 2 s(n)) time using at most s(n) 3 
processors. In fact, the main advantage of the 
recursive s(n)-factorization is that it only requires 
O(Iogn) stages. This implies that the entire 
s(n)-factorization of A can be computed 
inO(logn(log 2 s(n)) time using s(n) 3 processors. 
However, on even the most parallel of currently 
available computers such as the MPP and the 
Connection Machine, this bound will severely 
restrict the size of problems that can be solved. 
Fortunately, the PND algorithm can easily be slowed 
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down so that the number of processors required 
matches that of a specific parallel machine. In 
particular, if s(n)=n 1/2 then n processors is 
sufficient provided the algorithm's speed is reduced 
by an n 1/2 factor. As a matter of fact, a careful time 
analysis shows that the total time to complete the 
s(n)-recursive factorization of A in this case is 
0(n 1/2 ). As we shall see in the next section this 
allows very large sparse matrices to be inverted on 
the MPP. 


Finally, the recursive s(n)-factorization allows us to 
write 



— (3) 

Thus, given an s(n)-factorization of A, it is easy to 
recursively compute A _1 b for any column vector b of 
length n. This "backsolving" computation can be 
performed in time O(lognlogs(n)) using 
s(n) processors. It should be noted that here we 
do not have to slow down the parallel computation 
when s(n)=n 1/2 and the number of available 
processors is n. 


In summary, a step in the recursive s(n)-factorization 
of A is accomplished by moving one level up in the 
separator tree removing the level h separators from 
the sub-graphs by eliminating the corresponding 
unknowns of Af,^. This involves 0(NjJ matrix 
operations on matrices of size 0(n^ k 2 ) (see 
above). If these operations are performed bv 
systolic algorithms, they require a maximum of s(n) 2 
processors and take 0(s(n)) time. Once the root of 
the tree is reached (N h =1, h=d-1), the 

s(n)-factorization is complete. Two traversals of the 
separator tree are then necessary to perform the 
"backsolving" because of the recursive nature of 
equation (3) and the total time for these 
computations is 0(lognlogs(n)). 


IMPLEMENTATION OF PND ON THE MPP 

In this section we will discuss an implementation of 
PND, for the case where G(A) is a n 1/2 xn 1/2 grid 
graph, for mesh-connected parallel computers, in 
general, and for the MPP in, particular. 

Representing a Grid Graph as a Set of 
Boxes 

To be able to fully explain our implementation of 
PND on the MPP, we first need to expand on some 
of the ideas of the previous section. Consider a 
phase h of the recursive factorization where X h is a 
block-diagonal matrix consisting of a total of N^ 
blocks x hk (k=1..N h ) of size at most n h k xn h k . 
Each of these diagonal blocks can be associated 
with a given separator. The same applies to the 

triplets y^kfxh^r^yh.k) 7 ( size °( n h,.k ). of 
Y^X^Y^ 1 . Furthermore, their existence implies 
that Z h can be broken up into N h blocks z h k also of 
size 0(n h k ). This in turn means that A^ can be 
considered as being made up of N^ smaller matrices 
a h k . Each of these will consist of a x h k , y^ k , 
(Yh k) ' and a z h k- The cruc ' a * observation, 
however, is that the’ m h k 's (m= a, x, y, z) are much 
smaller and denser than the big matrices to which 
they belong. Thus, a key element of the MPP 
implementation of PND is to compute A h+1 as 
N h+ i a h+ -j's. We will now describe in detail our 
implementation of the h™ stage of PND where 
A h+ i must be computed from A h . 

In figure 2, the grid graph of figure 1 has been 
extended so that each vertex is the junction of four 
edges, each of unit length. The resultant graph 
may be represented as a tiling of 2-D boxes with the 
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Figure 2 - 3x3 grid graph 


Figure 3 - Box representation of grid graph 


vertices lying along their perimeters ( see figure 3 ). 
Each box is a 2' 1 x2 h+1 rectangle or a 2 h x2" square 
depending on whether h is odd or even, 
respectively. Now, in computing the a-j k ’s (h= 0), 
the level zero separators are removed (’vertices 1, 

2, 3 and 4 for the grid graph of figure 3 ). This may 
be viewed as merging pairs of boxes into single , 
boxes. All vertices strictly internal to a box will be 
eliminated by the s(n)-factorization of A. The box 
representation of a unit edge length grid graph can 
be used at all levels in the separator tree and its 
traversal therefore translates into repeated 
box-compositions. A box b h k has 
2( n h k + n h . 1 k + 2) vertices on its perimeter. A 
matched pair of level h boxes is defined as two 
boxes that have a common separator s h k . The first 
vertex on either side of a separator will ’be referred 
to as contact points denoted by c 0 and c 1 , 
respectively. 

The arithmetic associated with the box-composition 
breaks down as follows: To compute the a h+ < k !s 
will require dense matrix inversions (or LDL ' 

decompositions), 2N| 1 matrix multiplications and 
matrix subtractions. These operations will be 
performed by systolic algorithms using sub-arrays of 
0((n h k r) processors ( see below for alternative 
non-systolic algorithms for the MPP). Because the 
two a h k 's of a pair of boxes are not completely 
disjoint ’due to the boxes' common edges it will also 
be necessary to merge pairs of a^ k 's by adding 
together the common edge coefficients in order to 
compute the z h k , y h k , x h k and y^ k needed in 


the calculation of a^ +1 k . This requires another Nu 
matrix additions. It should be noted that the final 
stage ( h=d-1 ) of the s(n)-factorization of A involves 
the inversion of a single matrix of size s(n)xs(n) and 
that the time required for this is 0(s(n)). Hence, the 
cost of the final stage is the dominant element in 
the total time for the factorization. 

The arithmetic described above will require 
inter-processor communications. The patterns of 
these, as well as their cost, depend on how A and 
its factors are mapped onto the local memories of 
the PEs. In the following a unique memory mapping 
scheme is described that creates local 
communication patterns that are supported by 
mesh-connected machines such as the MPP. This 
scheme also ensures that data can be sent to the 
correct processors using SIMD control so that no 
explicit address calculations are required. 

Mapping a Box Representation of a Grid 
Graph onto the MPP. 

Let each box of a box representation of a grid graph 
correspond to a (n^ k + n h-1 k + 2p neighborhood 
of processors such that the coefficients of adjacent 
boxes in the graph will be stored in adjacent 
neighborhoods of PEs. Within each of these 
neighborhoods, data is laid out using the following 
rule: The ordering of the coefficients should be that 
given by taking the vertices of the perimeter of the 
box in a clock-wise direction, starting in the lower left 
hand corner, just above the corner vertex. Figure 4 
shows the result of using our memory mapping 
scheme for the pair of boxes of figure 3 with 
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common edge (1, 7). On a mesh-connected 
computer with p processors, 

(p 1/2 /(n h k+rUj + i k + 2)) 2 boxes can be operated 
on in parallel. The actual location of each box within 
the processor array must be precomputed for all 
levels of the separator tree before the recursive 
decomposition of A can start. Our box-compositon 
strategy greatly simplifies this problem since it 
allows the processor in which each coefficient is to 
be stored to be determined based simply on the 
connectivity of the graph. 

Two additional ideas are essential to the successful 
use of the memory mapping scheme described 
above. First, it is helpful to think of the boxes as 
having an orientation: For even h, the separators 
are horizontal and the boxes lie along a north-south 
axis while for odd h the separators are vertical so 
that the boxes have an east-west orientation. 
Second, the merging of a pair of a h k 's ( see above ) 
is facilitated by dividing the vertices on the 
perimeters of the boxes into sequences that will 
contain different vertices depending on the 
position of a box in a pair. 

Let s h k be a separator at the h ,h level in the 
separator tree. Then, for a North (East) box let 
sequence a (sqa) be the (2nj,_.j k +n^ k +1) first 
vertices and sequence b (sqb) ’the c} , k (in 
reversed order) and c Q vertices. Also, for a South 
(west) box let sequence a (sqa) be the n h _i k first 
vertices, sequence b (sqb) the c 0 , s h k and c-| 
vertices and sequence c (sqc) the remaining 
vertices. The steps involved in merging pairs of 
a h k 's can now be described as: 

(i) For each north ( east ) box reverse the 
coefficients associated with the contact points 
about the data for the separators and move c 0 data 
to the processors containing this data for the south 
( west ) box 

(ii) For each north ( east ) box, reverse the order of 
the separator coefficients 

(iii) Insert the data for sqa and sqb ( minus the c 0 
coefficients ) for the north ( east ) box after the data 
for the sqa of the south ( west ) box: this 
automatically aligns the c-j and s^ k coefficients for 
the two boxes 

(iv) do matrix addition 


The data movements of (i), (ii) and (iii) translate into 
shifts through the mesh-connected network. The 
important points are that all shift distances are given 
by the dimensions of the boxes and that the data for 
common edges automatically line up. Having 
merged the pairs of a h 's, the computation of a h -j k 
becomes straightforward since all the required data 
is now stored in a local neighborhood of 
processors. 

"Backsolving" and Performance Issues 

PND reduces the problem of inverting (LDL^ 
decomposition) a large sparse matrix, A, to that of 
inverting a number of much smaller dense matrices 
by computing a recursive s(n)-factorization of A. 
Now, a choice has to be made as to which 
algorithms should be used for matrix inversion(or 
LDL t decomposition) and multiplication on the 
MPP. We have implemented both systolic and 
non-systolic algorithms for these operations. In the 
case of inversion, a comparison was made between 
a non-systolic Gauss-Jordan and a systolic Givens 
rotation method (Ref. 1) This comparison showed 
that although the latter required much less 
inter-processor communications than the former, 
this was off-set by the MPP's broadcasting facility 
and the smaller number of arithmetic operations of 
Gauss-Jordan. A similar result was obtained for the 
matrix multiplication algorithms. 

Finally, let us consider the "backsolving" part of 
PND. An inspection of equation (3) shows that in 
order to minimize the total amount of arithmetic 
required to solve a sparse, linear system using PND, 
some of the quantities computed during the 
factorization of A should be saved for the 
"backsolvjng" calculation. These include 
x h k 1 7h k ancl x h k" 1 • Since each processor of 
the MPP 'has only ik-bits of local memory, it would 
be impossible to store quantities where they are 
computed. However, the processors also have 
access to 16k-bits of staging memory each. This 
gives the MPP an impressive total amount of 
storage and makes possible the saving of quantities 
described above. The overhead of moving data 
between local processor memories and the staging 
memory would be less than that incurred if 
quantities had to be recomputed during the 
"backsolving" stage. 

The remaining issue is now how to perform the 
sparse multiplication A^b given our memory 
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mapping. There are two possible approaches - 
either to divide b into segments based on the 
vertices of the perimeters of a box or to store b as a 
single global data structure. We have chosen to 
implement the former strategy. Note that in 
principle, the sum-or tree of the MPP can be used 
to pipeline the solutions to many sparse linear 
systems, Ax=b, assuming A is the same for all of 
them. In particular, we can solve 0(s(n)) systems 
with distinct vectors b in total time 0(s(n)) using 
s(n)2 processors. 

With the implementation of PND outlined in this 
section, and given the storage constraints of the 
MPP, estimates show that linear systems with up to 
16k unknowns can be recursively factorized in 
about 40 seconds. The "backsolving" will take 
about 1 second. It should be noted that the size of 
the problems that can be tackled by our 
implementation of PND for mesh-connected parallel 
computers is only limited by available machine 
memory. 
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